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Objectives: Predicting the length of stay (LOS) of patients in a hospital is important in providing them with better services and higher 
satisfaction, as well as helping the hospital management plan and managing hospital resources as meticulously as possible. We propose 
applying data mining techniques to extract useful knowledge and draw an accurate model to predict the LOS of heart patients. Meth- 
ods: Data were collected from patients with coronary artery disease (CAD). The patient records of 4,948 patients who had 
suffered CAD were included in the analysis. The techniques used are classification with three algorithms, namely, decision 
tree, support vector machines (SVM), and artificial neural network (ANN). LOS is the target variable, and 36 input vari- 
ables are used for prediction. A confusion matrix was obtained to calculate sensitivity, specificity, and accuracy Results: The 
overall accuracy of SVM was 96.4% in the training set. Most single patients (64.3%) had an LOS <5 days, whereas 41.2% of 
married patients had an LOS >I0 days. Moreover, the study showed that comorbidity states, such as lung disorders and hem- 
orrhage with drug consumption have an impact on long LOS. The presence of comorbidities, an ejection fraction <2, being a 
current smoker, and having social security type insurance in coronary artery patients led to longer LOS than other subjects. 
Conclusions: All three algorithms are able to predict LOS with various degrees of accuracy. The findings demonstrated that 
the SVM was the best fit. There was a significant tendency for LOS to be longer in patients with lung or respiratory disorders 
and high blood pressure. 
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I. Introduction 

Coronary artery disease (CAD) is a major cause of disability 
in adults and a major cause of death in developed countries 
resulting in several illnesses, disabilities, and deaths as well. 
It should be noted that cardiovascular diseases are character- 
ized by prolonged length of stay (LOS) [1], LOS is defined as 
the number of days that a patient is hospitalized in a hospital 
or a similar medical facility. There has been considerable 
interest in controlling hospital costs, particularly in cardiac 
diseases; thus, hospitals try to make LOS as short as possible 
[2]. The length of hospital stay is an actual parameter applied 
to identify health care resource utilization, health cost, and 
severity of illness [3] . The use of LOS is highly predictive of 



Peyman Rezaei Hachesu etal 

inpatient costs as a marker of resource utilization [4]. Hos- 
pitals have severely limited beds to hold inpatients, and as 
most of them are facing substantial financial pressure, it is 
extremely important to find ways to reduce health care costs 
[5]. One solution is to predict and determine the discharge 
date and LOS of each patient by a number of complementary 
techniques and technologies, such as data mining [6] . For a 
hospital administrator to be considered successful, predict- 
ing and evaluating LOS data is laborious but essential [7]. 
Precise prediction of LOS facilitates the efficiency of bed 
occupancy management in hospitals. Therefore, exact and 
proper prediction of LOS has become increasingly important 
for hospital management and health care systems [4] . Mean- 
while, awareness of factors and elements that determine LOS 
could promote the development of efficient clinical pathways 
and optimize resource utilization and management [8]. In 
addition, many hospitals cannot predict and measure future 
admission requests. Many hospitals have no ability to pre- 
dict and measure future admission requests. Also, successful 
prediction of discharge dates and duration of hospital stay 
allows the corresponding scheduling of elective admissions, 
leading to diminished variance in bed occupancy [9]. Pro- 
viding an efficient and accurate model to predict LOS for 
different types of diseases is one of the issues considered by 
researchers. Obviously, developing models for predicting and 
determining LOS in hospitals can be very useful for hospital 
management, particularly for prioritizing health care policies 
and promoting health services, comprising the appropriate 
allocation of health care resources according to differences in 
patients' LOS along with considering patients' health status 
and social-demographic features [3]. Ideally, better predic- 
tion models are needed to facilitate the decision-making pro- 
cess and cannot be replaced by judgment. For these reasons, 
providing an efficient and accurate model to predict LOS for 
various types of diseases is one of the issues considered by 
researchers. However, there has been relatively little research 
related to LOS prediction. Therefore, we applied data mining 
techniques to extract useful knowledge and suggest a model 
to estimate length of stay for coronary artery patients in car- 
diovascular centers. 

1. Literature Review 

Studies on factors contributing to LOS have regularly ap- 
peared in the literature. One study conducted to determine 
the factors affecting LOS in public hospitals in Lorestan 
Province, Iran demonstrated that, first, an increase in age 
would lead to an increase in average LOS and, second, the 
average LOS of men is longer than that of women. The t-test, 
one-way ANOVA, and multifactor regression were used for 
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the analysis. They did not provide any prediction model, 
because they focused on descriptive analysis based on tra- 
ditional statistical methods [10]. Rowan et al. [8] proposed 
and implemented a software package demonstrating that 
artificial neural networks (ANNs) could be used as an effec- 
tive LOS stratification instrument in postoperative cardiac 
patients. Blais et al. [11] designed a screening and rating tool 
to quantify variables related to LOS in a medical psychiatric 
unit. The findings from this study showed that 25 variables, 
including patient, illness, and treatment variables, were likely 
to be related to LOS. Tu and Guerriere [12] indicated that 
ANNs can be used as a predictive tool to identify patients at 
increased risk for prolonged intensive care unit LOS follow- 
ing cardiac surgery. They claimed that the back propagation 
algorithm had not previously been developed for this area. 
Lin et al. [13] explored the prediction of hospital stays for 
first-time stroke patients in a rehabilitation department by a 
proportional hazard regression (HR) model. They proposed 
using the HR model to predict the mean LOS of stroke pa- 
tients. Jiang et al. [5] studied the use of four data mining 
techniques (logistic regression, neural network, decision 
tree, and ensemble model) to analyze the inpatient discharge 
data for average LOS based on input variables. The findings 
from this research showed that the ensemble model was 
the best fit, and age and chronic disease were the important 
predictors. Misclassification and average squared error were 
used to assess the models. The ensemble model had the 
lowest average squared error (0.21), and the decision tree 
had the highest average squared error (0.22). Wrenn et al. 
[14] were able to predict LOS for an emergency department 
through developing and validating an ANN. The results were 
promising and showed that ANN can predict a patient's LOS 
within an average of <1.99 hours. Using a cohort of prospec- 
tively identified heart failure patients, Wright et al. [1] found 
that peripheral edema, chest pain, fatigue, serum albumin, 
serum sodium at admission and peak creatinine could result 
in hospital stays longer than six days. Blais et al. [11] studied 
factors that differentiated psychiatry patients' short LOS (7 
days or less) and long LOS (more than 14 days). Age, impair- 
ment level, and +6 independent functioning levels were all 
independent predictors of LOS. 

As previously mentioned, most of the research on LOS has 
been conducted in rehabilitation and psychiatric fields [15]. 
Most models in the cardiac disease area have predicted in- 
hospital mortality [16], and statistical methods, especially 
descriptive analyses, have been applied in that research. 
Hence, raising our awareness of factors that have an impact 
on cardiac patients' LOS is essential in order to determine 
and develop a useful and efficient model to predict LOS. This 
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research aims to investigate the important factors that can be 
assessed to predict the LOS of patients with coronary artery 
diseases. 

2. Data Mining Algorithms 

Finding undiscovered information and useful patterns in a 
database is often referred to as data mining [17]. Data min- 
ing is heavily used in the health and medical field in applica- 
tions such as disease prediction and patient management 
[18]. Relationships, rules, and essential information about or 
from the data cannot be easily extracted because of database 
size and other features. We used some of the most common 
predictive data mining methods for our goals as follows. 

1) Artificial neural networks 

ANNs are used to perform multivariate analysis to identify 
both linear and non-linear patterns among data variables 
[19]. Due to their good predictive performance, ANNs are 
the most popular method in various areas of medicine [20] 
and lead to appropriate decisions. An ANN consists of many 
connected processing elements, including multiple input 
nodes and weighted interconnections. The radial-basis- 
function (RBF) ANN was developed to recognize CAD. 

2) Support vector machines (SVMs) 

A category of classification that has received increasing at- 
tention in recent years is the SVM. It is a new method for 
classification of both linear and non-linear data [21], and in 
terms of predictive accuracy, it is a powerful algorithm . In 
fact, SVM is a linear learning machine constructed through 
an algorithm that uses an optimization criterion. We apply 
RBF kernel mode because of its good general performance 
and because it has the smallest number of parameters [22] . 

3) Decision tree 

Generally, a decision tree as a visual and analytical decision 
support tool is a graphic representation of obtained knowl- 
edge in the form of a tree (flow chart like structure), where 
each non-leaf node denotes a test on an attribute, and each 
branch indicates an output of the test [23]. It uses a combi- 
nation of mathematical and computational techniques to aid 
description and classification, and to extract knowledge of 
data set [24]. Because nodes and branches are organized hi- 
erarchically, they are easy to understand and interpret. They 
are reliable and have better accuracy in clinical decision- 
making [25]. C5.0 decision trees are the most current deci- 
sion tree algorithms. The C5.0 algorithm with 10-fold cross 
validation and 20 trials using boosting was applied in this 
research. 
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4) Ensemble models 

The ensemble method creates a new model by combining 
SVM, C5.0, and ANN models. 

II. Methods 

1. Patient Population 

The cohort consisted of hospitalized patients during the 
study period, which started on July 18, 2006 and ended on 
December 30, 2011. We identified 4,948 patients who were 
admitted to the Academic and Educational Hospital of Ra- 
jaei Cardiovascular Medical & Research Center in Tehran, 
Iran with heart disease-related diagnoses. Only CAD data 
were included in the study (n = 3,512). Significant CAD was 
denned as at least one point of 50% or greater diameter ste- 
nosis in at least one coronary artery vessel [26] . 

2. Data Set 

The data sets were stored in a database management system 
of Microsoft structured query language (MS-SQL) database. 
We extracted and constructed a new data set for LOS of 
CAD. However, 246 patients were removed from the analy- 
ses, because patient records data, such as identification, were 
unavailable from the data set. Thus, 3,266 patients were in- 
cluded in the final data set for further analysis. Table 1 shows 
features with acceptable class and values. The data set con- 
tained 36 attributes. Finally, we organized the data set into 
two groups, including categorical and numerical features. 
We categorized data values and derived new fields from 
existing data in the following features: ejection fraction, dia- 
stolic blood pressure, systolic blood pressure, smoking, tri- 
glyceride, low-density lipoprotein, high-density lipoprotein, 
hemoglobin, serum cholesterol, and fasting blood sugar. 
These features were changed to categorical attributes for bet- 
ter analysis and to obtain good results. 

3. Data Pre-processing 

Data cleansing and preprocessing are essential to have op- 
timal results [27]. Therefore, we performed the following 
cleansing and preprocessing: repeated records, fields with 
spelling errors, additional tokens, other irregularities, and 
irrelevancies were deleted. The next step of pre-processing 
was the handling of patient records with missing and outlier 
data. 

4. Dealing with Missing Values 

The hospital data set had many features with missing values. 
Several replacement strategies were adopted to fill the missing 
values. First, if a feature was encountered in more than 50% 
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Table 1 . The demographic and clinical characteristics of the length of stay data set (n = 2,064) 



Vdl IdUIC 




r Lilac Idle ^uuiiiy 


Niimpr iral C\0 1 50^ mpan + SF) (1R Q + ~\C\ ^ 
1M U-lllCI ICdl yJ\J 1 D\J ) , lllCdll X JLJ y/^j." X LXJ.Z.) 


Age (yr) 


Numerical (15-94), mean ± SD (58 ± 3) 


Serum creatinine (mg/dL) 


Numerical (0.2-11.6), mean ± SD (1.2 ± 0.55) 


Gender 


1, male; 0, female 


Fasting blood sugar (mg/dL) 


1, 70-100; 2, 101-126; 3, >127 


Serum cholesterol (mg) 


1,<200; 2,200-239; 3, >240 


Hemoglobin (gm/mL) 


1 (normal), >13.5 and <18 (men & age >17 yr) or >12 and <16 (women & age >17 yr) or 
>11 and <16 (age <17yr); 

2 (low level), <13.5 (men & age >17yr) or <12 (women & age >17yr) or <11 (age <17 yr); 

3 (high level), >18 (men & age >17 yr) or >16 (women & age >17 yr) or >16 (age <17 yr) 


TT" 1_ J ' ±. 1' . ■ / / It \ 

High-density lipoprotein (mg/dL) 


1 (best), >60 yr; 2 (poor), men & <40 yr or women & <50 yr; 3 (better level), men & 
4U— ~>y yr or women oc dv yr 


Low-density lipoprotein (mg/dL) 


1 (optimal;, <ioo; I (near optimal;, loo-t/y; 3 (Dorder line nign;, 130-tDy; 4 (nign;, 

1 fiO-1 89- S (vprv hrahl >190 


nPriCTlir^'^TiHp Imo/nl 1 

XI IJily LCI 1UC yLLLiLl 


1 <r1Sn-? 1S0-199-3 700-499-4 > c i00 


IVldllLdl SLdLLlS 


0 cinolf 1 m^Trip/l 


T~)i;ii\ptPC rnpllil~llC 
J_Vld.UCLCa 111C111LU-S 


1, lllbUJI y Ul UldUCLO, U, 1LKJ ciULll lllSLUI y 


T- Tittup f+Pf* ci r\fi I r~\~\ in TrTrr i 


u, llu, l, yea 


P^i m i lv n i ctr^r^r c\t rewrite d y~\t n i cp£i cp 
l^dlllliy UloLtJiy \JL cuiuildly UlscaaC 


u, nu, i, yes 


Past history of heart disease 


0, no; 1, yes 


Dyslipidemia 


0, no; 1, yes 


Smoker or not 


2, current; 3, past; 4, recent; 5, never 


Ejection fraction 


1 (good), 50-75; (fair), 30-49; (poor), <30 


Chest pain 


2, yes; 3, no 


Systolic blood pressure (mmHg) 


1 (hypotension), <90; 2 (desirable), 90-119; 3 (border line hypertension), 120-139; 4 (hy- 
pertension), >140 


T~v " j_ 1 " 1 1 J / TT\ 

Diastolic blood pressure (mmHg) 


1 (hypotension), <60; 2 (desirable), 61-79; 3 (border line hypertension), 80-89; 4 (hyper- 
tension), >90 


Exercise stress test 


0, normal; 1, abnormal 


Absence or presence of one or more 


0, no; 1, yes 


disorders as well as a primary disease 




Valvular heart disease 


0, no; 1, yes 


ST segment and T wave of 


0, normal; 1, having ST-T wave abnormality 


electrocardiogram changes 




Coronary artery disease diagnosed 


0, no; 1, yes 


by physicians (diagnosis) 




Drug category 11 


0, not used; used 


Type of medical insurance used 


1, medical services insurance; 2, insured rural; 5, social security 


by the patient 




Length of stay (LOS, day) 


1, if LOS > 0 and LOS < 5; 2, LOS between 6-9; 3, LOS > 10 



SD: standard deviation. 

aStatin, nitrates, inotropic, diuretic, calcium, channel blocker, beta blocker, antiplatelet, anticoagulant, angiotensin-converting- 
enzyme inhibitor. 
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Table 2. Attribute with missing and alternative value 



Attribute value 


Missing data (°/o) 


Met nod, 

a Itprnativp valnp 

a 1 L\_ 1 1 1 Cl L 1 V v_ YCllLR 


App 


0 




Sex 


o 




rVTarital status 


o 




pYPrr~isp strpss tpst 


o 




ahptps 


o 




T-TvT">pt"tpnsion 


o 




J_-* y all jJlLLCllllcl 


n 




Pamilv niQtnrv 

1 dllllly llloLUl y 


o 




Pninnt'ni n itv 
vjUiijui uiu.lL y 


5.24 


Mode- fl 


Ejection fraction 


4.36 


Mode; 1 


Diagnosis 


2.17 


Mode; 1 


Hemoglobin 


8.86 


Mode; 2 (for class 1) 






Mode; 2 (for class 2) 






Mode; 2 (for class 3) 


Creatinine 


11.2 


Mean; 1.90 (for class 1) 
Mean; 1.25 (for class 2) 
Mean; 1.18 (for class 3) 



of records with missing values, that characteristic was deter- 
mined not to be an effective feature in the analysis. As a result, 
such a feature, such as weight or job, was removed. Second, 
if a feature was encountered in less than 12% of records with 
missing values, the mean values of records were replaced in- 
stead of missing values in the numeric features. For example, 
creatinine showed 11.2% missing data. The mean value was 
replaced according to its accepted class (Table 2). If the feature 
was in nominal or ordinal type, mode values were replaced. 
The comorbidity, ejection fraction, hemoglobin, and diagnosis 
features followed the mentioned rule (Table 2). 

In the third strategy, the C5.0 algorithm was applied to 
those features showing missing values in more than 10% of 
records. We filled the missing values of these features using 
this algorithm with the highest accuracy according to Table 3. 

To resolve outliers in each feature, we transformed the data 
to Microsoft Excel format and detected outlier data that was 
clear using methods such as sorting. Otherwise, the nearest 
acceptable non-outlier value for that feature was used to re- 
place the outlier value [28] . 

5. Attribute Coding 

Data was coded by some valid resources, such as heart dis- 
ease associations and the Wikipedia Website. Scaling and 
coding features are given in Table 1. 
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Table 3. Number of features with missing data values and accu- 
racy results 



Feature 


Missing data 
(%) 


Accuracy result 

(o/o) 


High-density lipoprotein 


32.7 


96.8 


Low-density lipoprotein 


16.6 


94.2 


Pulse rate 


25.4 


Mean error 0.7 a 


Systolic blood pressure 


24.6 


97.9 


Diastolic blood pressure 


24.6 


96.9 


Chest pain 


21.1 


97.6 


Cholesterol 


19.7 


97.9 


Fasting blood sugar 


18.1 


91.7 


Triglyceride 


16.9 


95.8 


Smoking 


26.6 


97.2 



a Calculated by linear regression. 



6. Training and Test Data Sets 

After cleaning and preprocessing, 2,064 completed records 
were extracted and obtained for data mining tasks. Separat- 
ing the data into training and testing sets is an important 
part of evaluating data mining models. We partitioned the 
data set into a training set and a testing set; 80% of the data 
(1,643 records) was used for training, and 20% of the data 
(421 records) was used for testing. The training set was used 
to adjust the parameters of the models, and the testing set 
was used to evaluate its predictive ability. 

III. Results 

1. Statistical Analysis 

The mean age of 2,064 patients was 58.2 ± 13.0 years (aged 
15-94) with most subjects between 54-64 years old. The 
sample was composed of 1,266 (61.3%) men and 798 (38.6%) 
women. 1,264 (61.2%) patients were diagnosed with CAD 
and 800 (38.8%) without CAD. LOS class 3 included most 
patients (39.3%); LOS class 1 and LOS class 2 comprised 
35.8% and 24.9%, respectively. Table 4 demonstrates statisti- 
cal results of the data set. 

These data mining models were developed by a data min- 
ing classification tool. We evaluated the model created using 
training data and then applied test data to compare the re- 
sults. We used SPSS Clementine 12 (SPSS Inc., Chicago, IL, 
USA) to build mining models. 

The performance of a diagnostic method is usually evalu- 
ated in terms of classification accuracy, sensitivity, and speci- 
ficity. In fact, accuracy is the percentage of correct decisions 
if CAD is predicted when the test is true and a non-CAD is 
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predicted when the test is false [29]. Sensitivity is the true 
positive, and specificity is the true negative rate of CAD. 
Table 5 shows the sensitivity, specificity, and accuracy of 



Table 4. Statistical result of some important features 



\/q fi Q ri 1 
Vdl IdUIC 


va 1 ue 


Proportion 

(o/o) 


1 <T\ 1 

LUJ 1 


LUJ £. 


LUJ O 


Marital status 


0 


8.6 


63 


38 


77 




1 


91.4 


675 


476 


735 


Past history of 


0 


72.4 


543 


356 


595 


CAD 


1 


27.6 


195 


158 


217 


Diabetes 


0 


71.5 


373 


199 


228 




1 


28.5 


365 


315 


584 


Diagnosis 


0 


38.8 


373 


199 


228 




1 


61.2 


365 


315 


584 


Insurance 


5 


46.5 


341 


255 


364 




1 


13.5 


11 


66 


101 




2 


12.1 


75 


54 


121 


Smoking 


2 


33.1 


305 


176 


203 




3 


19.7 


153 


99 


154 




4 


4.6 


25 


31 


50 




5 


42.6 


255 


208 


405 


Comorbidity 


0 


58.2 


513 


297 


391 




1 


41.8 


225 


217 


421 


Fasting blood 


1 


42.3 


322 


204 


348 


sugar 


2 


31.1 


242 


167 


232 




3 


26.6 


174 


143 


232 


Diastolic blood 


2 


37.9 


275 


189 


319 


pressure 


3 


53.2 


396 


277 


426 




4 


7.9 


59 


44 


62 


Ejection 


1 


25.3 


250 


144 


128 


fraction 


2 


74.4 


483 


369 


68 


Chest pain 


2 


74.9 


585 


386 


576 




3 


25.1 


153 


128 


236 


Hemoglobin 


1 


25.3 


250 


144 


128 




2 


74.4 


483 


369 


684 



LOS: length of stay, CAD: coronary artery disease. 



Table 5. Analysis of length of stay data set with classification 
techniques 



Algorithm 


Accuracy 


Specificity 


Sensitivity 


(o/o) 


(o/o) 


(o/o) 


Decision tree (C5.0) 


83.5 


65.2 


97.1 


Neural network 


53.9 


65.1 


72.2 


Support vector machine 


96.4 


97.3 


98.1 


Ensemble algorithm 


95.9 


93.4 


98.2 
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different classification techniques. A confusion matrix was 
obtained to calculate sensitivity, specificity, and accuracy. 
The overall accuracy of SVM was 96.4% in the training set. 
The ensemble algorithm showed a stronger performance 
than other algorithms with a sensitivity of 98.2%. 

2. Important Features 

The relative importance of each variable in evaluating the 
model is associated with the importance of each feature in 
making a prediction, and it does not relate to the model ac- 
curacy [30]. Also, the sum of the values for all variables in 
algorithms is 1.0. The SVM model, with earlier parameter 
setting, was used to extract important factors. In Table 6, 
features with great impact on CAD are listed in order of 
variable importance. The most significant variables were 
drug categories, such as nitrates and anticoagulants as well 
as CAD diagnosis. Comorbidity is also a strong predictor of 
prolonged LOS. Sex was significant in predicting LOS since 
men had longer LOS than women. Age played a notable role 
as well since analysis revealed that patients aged <50 and 
>80 statistically had increased mean LOS. LOS class 1 com- 
prised mostly single patients (64.3%) and 24.5% of married 
patients, 41.2% of married patients were in LOS class 3 and 
34.3% were married in LOS class 2. Furthermore, insurance 
type had a predictive power. Patients with social security 
and rural medical insurance were in LOS class 3. Thus, the 
most notable factors influencing LOS obtained by algorithm 



Table 6. Important features extracted by support vector machine 
model 



Features 


Relative weight 


Anticoagulant drugs 


0.1824 


Nitrate drugs 


0.1033 


Diagnosis 


0.9200 


Diastolic blood pressure 


0.7870 


Ejection fraction 


0.1280 


Comorbidity 


0.0586 


Marital status 


0.0424 


Chest pain 


0.0272 


Sex 


0.0340 


High-density lipoprotein 


0.0310 


Hemoglobin 


0.0870 


Smoking 


0.0780 


Insurance type 


0.0350 


Cholesterol 


0.0460 


Age 


0.2130 


ST-T change 


0.0320 
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Table 7. Important significant of extracted rules with using of C5.0 algorithm 
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A n tppprl pn t 

r\l 1 LCI.CU CI 1 L 




Consequent 


1 


If diagnosis = 1.0, comorbidity = 1.0, Hgb = 2.0, smoking in [2.0, 3.0], triglyceride in [1.0 and 2.0] a 




then LUb = 


3 


z 


If Hgb = 2.0, creatinine < 1.80, diagnosis = 1.0, comorbidity = 1.0, smoking in [2.0] and EF < 2.0 




then LUa = 


3 


5 


it rigo — t.u, r-r s t.u, smoking [Z.o, j.U, 4.UJ ana cnest pain — d.v 




then LOS = 


1 


4 


If comorbidity = 0.0 and smoking = 5.0 




then LOS = 


2 


5 


11 diagnosis = 0.0, chest pain = 2.0, and EF in [2.0] 




then LOS = 


3 


6 


If diagnosis = 0.0 and EF = 1.0 and triglyceride = 2.0 and creatinine >1.1 and comorbidity = 0.0 and 
insurance = 5.0 




then LOS = 


1 


7 


If EF = 2.0 and ST = 1.0 and smoking = 2.0 and comorbidity = 0.0 and insurance = 9.0 




then LOS = 


2 


8 


If diagnosis = 1.0 and diastolic BP = 3.0 and EF = 1.0 and FBS = 2.0 and triglyceride = 3.0 and chest pain 


= 2.0 


then LOS = 


3 




and insurance = 5.0 








9 


If diagnosis = 0.0 and EF = 2.0 and triglyceride = 3.0 and creatinine < 1.1 and past history = 0.0 and smok- 


then LOS = 


1 




ing in [2.0, 3.0, 5.0] and marital = 1.0 








10 


If diagnosis = 1.0 and diastolic BP = 3.0 and EF = 1.0 and FBS = 2.0 and triglyceride = 3.0 and chest pain 
and insurance = 5.0 


= 2.0 


then LOS = 


3 


11 


If comorbidity = 0.0 and insurance =1.0 and marital = 0.0 




then LOS = 


1 



BP: blood pressure, EF: ejection fraction, Hgb: hemoglobin, FBS: fasting blood sugar, ST-T: ST segment and T wave of electrocar- 
diogram changes. 

"Values are according Table 1 (1, if LOS > 0 and LOS < 5; 2, LOS between 6-9; 3, LOS > 10). 



are drugs (nitrates), being diagnosed with CAD, comorbid- 
ity, hemoglobin, ejection fraction, insurance type, smoking, 
family history, and sex. These factors were extracted with 
99.1% accuracy by the SVM. 

3. Extract Rules 

This section describes how these significant rules were 
extracted. Based on the C5.0 algorithm with the previ- 
ously mentioned parameter setting, 11 rules, which are 
interpreted as IF (antecedent) and Then (consequent) in 
Table 7, were generated with a mean estimated accuracy 
of 95.3%. The presence of comorbidities such as lung and 
digestive disorders, ejection fraction <2, currently being a 
smoker, and insurance type = 5 in coronary artery patients 
was associated with them having a longer LOS than other 
subjects. Absence of comorbidities, being a nonsmoker, be- 
ing single, not having chest pains, and having medical ser- 
vice insurance had a positive effect on decreasing LOS. The 
extracted rules are interpreted in Table 7. However, more 
investigation with more features and larger data sets is still 
required. 

IV. Discussion 

This study investigated the determinants of length of hos- 
pital stay in patients' representative of CAD admitted to a 



cardiovascular center. Many studies of length of hospital stay 
predict the duration of stay based on laboratory parameters 
or other quantifiable variables [1]. Our findings showed that 
a LOS greater than 10 days was associated with comorbidity 
and diastolic blood pressure features. There was a significant 
tendency for LOS to be longer in patients with lung or respi- 
ratory disorders and high blood pressure. Hence, comorbidi- 
ties such as lung disorders and hemorrhage have an impact 
on long LOS and are important features in predicting LOS. 
However, Appelros [31] in his study demonstrated that co- 
morbidities do not significantly influence LOS, while smok- 
ing has an inverse effect on acute LOS. They claimed that 
stroke severity is an important predictor of both acute and 
total LOS. Some studies have reported that patient demo- 
graphics and hospital attributes were the two major factors 
that contributed to identifying patient LOS [3], and the most 
useful patient feature for predicting LOS was patient's age 
[32]. In many studies, age has been found to be a very sig- 
nificant predictor of LOS [5]. Our results from the retrospec- 
tive study of LOS replicate a number of previously reported 
findings. 

In this study, the extracted rule demonstrated that patients 
with normal levels of hemoglobin, medical services insur- 
ance, ejection fraction with class 1 (good level, 50-75), and 
those with no past history of cardiac disease and comorbidi- 
ties and also non-smokers had a normal LOS in hospital. We 
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found that married men had longer LOS than single men 
and women. Patients who were using statins and blood an- 
ticoagulant drugs had a prolonged LOS. These findings did 
not conform to those of other study [2] . 

Our promising results indicate that SVM and an ensemble 
model, which applies three data mining algorithms, can 
predict patient's LOS, but still SVM is the best fit. We sug- 
gest that the SVM model is optimal for predicting mean LOS 
of CAD. According to other studies, SVM has the highest 
forecasting accuracy among other data mining algorithms. 
Today, this algorithm is becoming increasingly common in 
the medical and health field [33]. 

In addition to disease-related factors, LOS may be affected 
by factors unrelated to the disease, such as availability of 
hospital beds and rehabilitation facilities as well as discharge 
possibilities [31]. Due to the lack of medical facilities, staff 
shortages, and the increasing cost of health care, it is ex- 
tremely important to optimize LOS and to identify factors 
affecting it. Note that LOS is influenced by individual charac- 
teristics such as weight, disease status, patient management 
style, hospital management, organizational characteristics, 
and other features [34,35]. 

This study was confined to the exploration of length of 
hospital stay of CAD patients with no consideration of other 
factors, such as the ethnic and socio-cultural environment 
of each patient and admission status. Health care data is 
generally not structured, and it is distributed over various 
locations. In terms of practicality, the primary limitation of 
this study is that all data were obtained from a specialized 
cardiovascular center. The factors selected for LOS predic- 
tion tended to be less social and more condition specific (e.g., 
presence of cardiac coronary artery bypass graft). Some im- 
portant variables could not be considered, including factors 
such as alcohol consumption, other comorbidities, distance 
between patients' place of residence and the hospital, and 
admission type (elective and urgent). However, we attempted 
to identify the primary factors related to longer LOS. These 
data should also be collected uniformly to increase predic- 
tion accuracy. 
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